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Abstract 

Systolic arrays for determining the Singular Value Decomposition of 
a mxn, m > n, matrix A of bandwidth w are presented. After A has been 
reduced to bidiagonal form B by means of Givens plane rotations, the 

singular values of B are computed by the Golub— Reinsch iteration. The 
products of plane rotations form the matrices of left and right singular 

vectors. 

Assuming each processor can compute or apply a plane rotation, 0(wn) 
processors accomplish the reduction to bidiagonal form in 0(np) steps, 

where p is the number of superdiagonals. A constant number of processors 
can then determine each singular value in about 6n steps. The singular 

vectors are computed by rerouting the rotations through the arrays used for 

the reduction to bidiagonal form, or else "along the way" by employing another 
rectangular array of 0(wm) processors. 


Research was supported in part by the Office of Naval Research Contract 
N000014-82-K-0184 and in part by the National Aeronautics and Space 
Administration under NASA Contract No. NAS1-17070 while the author was in 
residence at the Institute for Computer Applications in Science and 
Engineering, NASA Langley Research Center, Hampton, VA 23665. 
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Introduction 


A singular value decomposition (SVD) of a m*n matrix A, m > n, 
consists of a factorization 

A « U S V T , 

where U and V are orthogonal matrices and E is a nonnegative 

diagonal matrix. 

The SVD is an indispensible tool for rank determination and 
handling of rank deficiencies, a more detailed account of its 
mathematical properties can be found in [6]. As for its applications in 
signal processing, Speiser and Whitehouse have presented its usefulness 
in adaptive beamforming and data compression [13]. 

With the advent of VLSI technology it seems now feasible to perform 
a SVD in real-time. A number of papers have recently dealt with 
algorithms for SVD of dense matrices amenable to implementation on 

o 

systolic arrays. An O(n^)-processor singular value array of [4] relies 
on a version of Hestenes' method where intermediate quantities are not 
completely updated; a formal convergence proof is not provided. Brent 
and Luk [1] construct an array for a one-sided orthogonalisation method 
due to Hestenes which uses a linearly connected mesh of 0(n) 
processors and 0(mn) steps to determine a singular value or else a 
two-dimensional 0(mn) processor array with a non-planar inter- 
connection structure and time 0(n log m). The method is quadratically 
convergent; experience suggests that 6 to 10 iterations per singular 
value provide for sufficient numerical accuracy. In [2] a similar 
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architecture of O(n^) processors implements a cyclic Jacobi method for 
the SVD in 0(m + n log n) steps. With the addition of QR and matrix 
multiplication arrays, the generalised SVD for matrices A e R mXn and 
B e RP* n is computed in 0(£ log n) steps on 0(£^) processors, 

Z > m,n,p [3]. With reference to the previous works Schreiber [12] 
suggests a method to cope with problems that do not match the array 
size. In [11] he proposes a kn-processor design which reduces a dense 
matrix to upper triangular form of bandwidth k+1 in time 0(mn/k). A 
k(k+l)-processor array from [8] is used to implement a SVD iteration for 
(k+l)-diagonal matrices (analogous to the Golub-Reinsch iteration for 
bidiagonal matrices) in time 6n + 0(k). For k = 0(n*^) this amounts 
to processor and hardware requirements of 0(n^/^). 

The systolic designs to be discussed are based on those in [8] and 
are thoroughly specified in [10]. They compute the singular values of a 
banded matrix A by first reducing A to bidiagonal form B and then 
computing the singular values of B. A VLSI implementation for the same 
problem was already proposed in [7]; this paper, however, substantially 
improves the bandwidth reduction array: it has become much simpler and 
is now also able to deal with problems not matching the hardware 
dimensions. The matrices of left and right singular vectors, U and 
V, are generated by accumulating the products of plane rotations. 

After a description of the SVD algorithm in the next section, a 
review of the systolic designs in [8] will be given. It is followed by 
a discussion of systolic arrays for bandwidth reduction and singular 
value computation for bidlagonal matrices. 
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2. Singular Value Decomposition 

The singular value decomposition (SVD) of a matrix A e R mXn , 
m > n, is 

U t AV = diag(o 1 ,...,o n ), 

where U e R mXm and V e R nXn are orthogonal matrices and 
diagCo^ , . . . ,0^) is a nonnegative diagonal matrix. The columns of U(V) 
constitute the left (right) singular vectors of A, and a ^ are the 
singular values. 

The eigenvalues of A^A are the squares of the singular values of 
A: 

V T (A T A)V = diag(Oj,...,o^). 

Hence, computation of the singular values of A can be performed by 

T 

computing the eigenvalues of A A. 

The QR algorithm for computing the eigenvalues of a symmetric 

matrix A e R nXn is based on the decomposition of A into an 

orthogonal (Q“* = Q^) matrix Q and an upper triangular matrix R, 

A = QR. 

It is most efficient when the original matrix A is first reduced to 
tridiagonal form T (tfj = 0 for i < j-1 or i > j+1). With Tq = T 

an iteration of the QR method takes the form 
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T i “ s i+l I Qi+l R i+l» T i+1 “ R i+lQi+l + Sf+l 1 
or, 

T i+1 = Q i+l T i Q i+r 

If the scalar s^ + j is chosen as the eigenvalue of the trailing 2x2 

submatrix closest to t^ ^ then this element will converge to an 

eigenvalue of A at least quadratically. Once t^^ . is "close" to 

n,n— 1 

zero t nn^ is a good eigenvalue approximation. In that case, is 

deflated (its trailing row and column are disregarded) and the procedure 
is repeated to find the next eigenvalue. 

Having formulated the SVD as an eigenvalue problem a straight- 
forward approach for its computation would be to first reduce the matrix 
A to bidiagonal form B (b^ =0 for i > j or i < j-1), and then to 
compute the eigenvalues of T = B^B via 

V T TV = diag(Oj,...,aj;). 

Then V is the matrix of right singular vectors and the left singular 
vectors are obtained from the QR decomposition of AV. Yet, the 

m 

explicit formation of B B squares the condition number of the problem 
and numerically results in loss of information. 

Golub and Reinsch present an alternative method which avoids the 
explicit formation of the matrix product [5], If T has nonzero 
offdiagonal elements, Qj is an orthogonal matrix and 
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T 1 “ Q 1 TQ 1’ 

it can be shown, e.g. [6], that Qj and Tj are uniquely determined by 
the first column of Qj. Furthermore, if P is an orthogonal matrix 
with the same first column as and P TP is reduced to tridiagonal 

form 

t 2 = (pq 2 ) t t(pq 2 ) , 

then |Tj| = | T 2 1 • Consequently, the singular values of B can be 

determined by iterations of the following form (Bq = B) , 

Let ?±+i be orthogonal with the same first column as Q±+\ 

Compute C i+1 " B i p i+1 

Reduce B^P^+i to bidiagonal form B^ + j. 

The effect of the shift s i+i is now concentrated in the matrix P^ + j 
and the above procedure is known as the "implicitly shifted" version of 
the QR method. Deflation takes place as for eigenvalue computations. 
Hence, the singular values of A are computed in two steps: 

(1) Reduce A to bidiagonal form B 

(2) Perform the above Iterations on B until the required 
singular values are found. 
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3. Systolic Arrays for Givens Rotations 

The purpose of an orthogonal matrix in this context is to stably 
reduce a matrix to a certain structured form by selectively introducing 
zero elements. It was shown in [8] that a VLSI implementation of Givens 
plane rotations would, unlike Householder transformations for example, 
require only nearest neighbour data communication. Thus, all orthogonal 
matrices will be products of plane rotations, where each rotation zeroes 
a single matrix element as follows: 



where 

if y^ = 0 then x' = x^, c = 1, s = 0, 

else x' = /x 2 l+ y 2 c = Xj/x', s = yj/x' » ( pl ) 

x i = cx i + sy i» y i “ " sx i + cy i» 2 < i < k, (P2) 

It is assumed that each, equation (PI) and (P2), takes one "time step". 
Two kinds of processors are needed to realise rotations, one to execute 
(PI) and another for (P2), see Figure 1. 
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Given a matrix A e R m * n , m > n, its diagonal consists of the 
elements a^, 1 < i < n, the : kth subdiagonal of a k+i i> 1 < i < m-k, 
and the kth superdiagonal of a i } R+i > 1 < i < n-k. In the sequel, 

only band matrices of bandwidth w = p + q + 1 will be considered where 
subdiagonal q > 0 is the leftmost nonzero subdiagonal and super- 
diagonal p > 1 is the rightmost nonzero superdiagonal. Consequently, 
either m=norm=n+q. 

A QR decomposition of a matrix A does not increase the bandwidth; 
for each eliminated subdiagonal a new nonzero superdiagonal is filled 
in, so R has bandwidth w with p+q superdiagonals. 

For an overview on systolic arrays for the QR factorization the 
reader is referred to [8]. There a linearly connected mesh of w 
processors, call it Lq, removes the qth, outermost, subdiagonal of A: 
each of its elements is removed by an appropriate rotation, (PI), which 
is then applied to the corresponding pair of rows, (P2). The product of 
these rotations forms an orthogonal matrix and 

R x = QjA, 

where Rj has p + 1 superdiagonals. The first, leftmost, processor 
of the linear mesh repeatedly computes (PI) while the w - 1 succeeding 
processors to its right each compute (P2), see Figure 2. The matrix A 
is input by diagonals, the qth subdiagonal enters processor 1, the' 
diagonal processor q + 1 and the pth superdiagonal enters the right- 
most, wth, processor. On output these diagonals are- shifted one place 


T 
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to the left, the (q-l)st subdiagonal exits from processor 1, the 
diagonal from processor q and the (p+l)st superdiagonal from processor 
w. The rotations flow to the right until they leave processor w. The 
time from the first input to the last output is 2n steps if m = n and 
2(n+q-l) if m > n. 

The QR decomposition of a band matrix with q subdiagonals, 
accomplished by routing the matrix through q successive Lq-meshes, 

takes 2(n+q-l) steps. The ith mesh encountered removes the (q-i+l)st 
subdiagonal. 

Elimination of a superdiagonal is achieved by multiplying from the 
right by an orthogonal matrix, which is the product of rotations to be 
applied to the columns of A, so 

L 2 ' 

where L2 has q + 1 subdiagonals. The data lines of the processors 
are reversed and the corresponding linear mesh, call it L^, is a 
"mirror image" of the one for subdiagonal elimination. The processor 
computing (PI) is now the rightmost, wth processor, and it sends the 
rotations travelling towards the left. The reduction to lower 

Hessenberg form (in the upper triangular part only the first super- 
diagonal is nonzero) of a square matrix A is accomplished by sending 
A through p - 1 successive L^-meshes; this takes 2(n+p-2) time 
steps. 

In general, k subdiagonals (superdiagonals) of A are eliminated 
in 2(n+k-l) steps by sending A through k successive Lq- (Ljj-) 


meshes 



In the sections to come, the systolic arrays will be described in 
terras of the linear meshes, Ljj and Lq, eliminating superdiagonals and 
subdiagonals, respectively, - rather than processors eliminating 
individual matrix elements. The subscript Q (from QR) will be 
connected with removal of subdiagonals, while H (from lower Hessenberg 
form) is associated with elimination of superdiagonals. 


4. Systolic Arrays for Reduction to Bidiagonal Form 

To keep hardware to a minimmum, an algorithm for bandwidth 
reduction will be chosen that does not increase the number of nonzeroes 
per row (even temporarily). 

The algorithm for a reduction to bidiagonal form of a banded matrix 
A to be presented here basically proceeds in two stages, removal of q 
subdiagonals followed by removal of p-1 superdiagonals (since m > n, 
subdiagonal removal decreases the order of the matrix, as well as the 
computation time and should be performed first). It will be assumed 
that the bandwidth of A does not exceed the size of the Lq- or Ljj- 
meshes, otherwise partitioning strategies have to be applied, see [9]. 

As for the first stage, 

Compute the QR decomposition of Aq = A, Rj ■ U^Aq 
R emove the filled-in superdiagonals, = RjVj 
L et Aj consist of the first n rows of Hj 
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For i = 1 . . . [n/p] 

T 

Compute the QR decomposition R^ + ^ = U i+l^i 
Remove the filled-in superdiagonals, #±+i ~ %+l v i 
Let he #±+i without its leading p rows and 

columns. 

The first two steps essentially reduce the problem size from mxn to 
nxn, i.e., the QR decomposition causes the last q rows of R^ to 
become zero. During the loop, the QR decomposition, which eliminates 
the zeroes in the lower triangular part, is followed by removal of the 
filled-in superdiagonals. This will restore the previously eliminated 
subdiagonals - save their p leading elements. Thus, disregarding the 
leading p rows and columns the whole process is repeated on the 

remaining matrix until the subdiagonal part has totally disappeared. 
After [n/p] such iterations, 

T T 

A [n/p]+l " U [n/p]+l **• U 1 A V 1 V [n/p]+l 

is an upper triangular matrix ([x] denotes the smallest integer equal to 
or greater than x). On an array with q Lq-meshes followed by q L H - 
meshes, all meshes being of size w, stage 1 takes time proportional to 
2n^/p. If on the order of n Lq-meshes are available so that 

uninterrupted pipelining is possible, the computation time comes to 
about 4nq/p steps* 
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During the second stage, n-2 iterations are necessary to reduce 
the upper triangular matrix A [n/p]+l t0 bidiagonal form B. The 
iterations are similar to the ones above. 



The corresponding array comprising p-1 L^-meshes succeeded by p-1 Lq- 

o 

meshes, each of size p+1, completes stage 2 in 2n^+0(np) steps. 

The example in Figure 3 illustrates several steps in the reduction 
to bidiagonal form of a matrix with three subdiagonals and two super- 


diagonals. Given an array for stage 1 with 2wq processors and one for 
stage 2 with 2(p^-l) processors the reduction to bidiagonal form takes 
2n 4 +0(n^/p) steps. If 0(n) such arrays are available the time 


reduces to 4np + 0(nq/p). 


For the computation of the singular vectors, the rotations forming 
T 

the are rerouted through the Lq-meshes and applied to a mxm 

identity matrix, while rotations forming the are input into Lg 

meshes to be applied to an nxn identity matrix. Since U and V are 
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gene rally full matrices, they have to be determined by inputting sub- 
matrices of order w/2 [7] that "fit into" the Lq- and L^-meshes. 

Alternatively, another rectangular 2(m-l)w processor array as in [11] 
may be employed to which rotations are input as soon as they have left 
the Lq- and L H ~meshes; the computation of the singular values and 
vectors can thus proceed concurrently. 


5. A More Flexible Array for Reduction to Bidiagonal Form 

Instead of having different arrays for stages 1 and 2, essentially 
one array can be shared by both of them. It consists of two separate 

parts, one succession of max(p,q) Lq-meshes of size w and another 

one containing the same number of L^-meshes of the same size. During an 

iteration of stage 1 the matrix is first entered into the Lq-part and 

thereafter into the Ljj-part. This order is reversed in stage 2. 

But now the size of the meshes may be wider than the actual 
bandwidth of the matrix. Yet, the matrix must be entered "leftbound" 
into the Lq-part and "rightbound" into the L^-part, so that the doomed 
sub- or superdiagonal enters the processor computing (PI). Hence, 
before entering the Lg-part the matrix may have to be aligned to the 
right, and possibly to the left before input to the Lq-part. Three 

different cases can occur. 

If p - 1 = q, no alignment is necessary in stage 1, since the 
bandwidth is equal to the size of the meshes, while the number of meshes 
corresponds to the number of subdiagonals to be removed. During stage 



-13- 


2, the matrix has to be shifted by q places during each transition 
between Lq- and L R -parts and vice versa. 

If p - 1 < q, no alignment occurs during stage 1. During stage 2, 
though, q - (p-1) meshes are idle, i.e. , they generate only identity 
rotations. Consider the reduction to Hessenberg form. After having 
traversed the first p - 1 L R -meshes, the first superdiagonal is in the 
wth processor, the reduction is completed. However, the remaining 
q - (p-1) meshes shift the matrix further to the right, each by one 
place, so it is "squeezed" out of the array to the right. To properly 
enter the Lq-part it therefore has to be shifted to the left - by the 
distance it was squeezed out, which is q - (p-1), plus the distance 
between the outermost, (p-l)st, subdiagonal and the first processor, 
i.e., w - (p+1). Thus, after leaving the L R -part the matrix must be 
shifted w + q — 2p places to the left before entering the Lq— part. It 
must be shifted the same distance to the right after output from the Lq- 
part. Figure 4 illustrates this situation. 

If q < p - 1, consider the QR decomposition in the first stage. 
After the first q Lq-meshes have been traversed, the remaining 
p - 1 - q meshes will squeeze the matrix out to the left, one place per 
mesh. To enter the Ljj-part, the matrix is shifted to the right - by 
p - q - 1 places, equalling the distance by which it was squeezed 
out. During the second stage shifting occurs by q places. 

Moreover, in [10] it is shown that, by slight reprogramming of the 
processors, one type of mesh can fulfill the functions of both Lq- and 
L R -meshes. A reduction to bidiagonal form of a matrix with bandwidth w 


i 
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is then performed on k meshes of size at least w. The alignment can 
be limited to k places if every kth processor in a mesh is a (PI) 
processor. 


6. A Systolic Array for Computation of the Singular Values 

The processor which implements the singular value computation for 
bidiagonal matrices is a special case of the one for eigenvalue 
computation [7]. It executes one step of the Golub-Reinsch iteration 
[5], 

T 

Let be a rotation removing the (2,1) element of 

Compute C i+1 = BjP i+ i 

Compute B i +1 by reducing to bidiagonal form. 

Notice that C^ + ^ differs from only in the first two rows and 

columns. One can assume, that it is computed separately and then passed 
through a network, built around the processor [7,10] depicted in Figure 
5, which might be viewed as a conglomerate of processors computing (PI) 
and (P2): 

Step 1 : values already in cell are r^ , sj, r 2 > 83 

a) input S 3 

b) generate P so that P 




c) 


compute 




0 

s 3 


d) output r', P and retain r', s', r', s 3 


Step 2 : 

e) input t2, t3 

f) generate Q so that (r^, r^Q = ( r 2> 


g) 


compute 





h) output t', Q and retain (s£, t', s'j, t') 
as (rj, S|, r 2 , s^) for the next operation 
of the cell. 


For the singular value computation this means that in every two steps 
the processor computes one pass through the following loop, which 
generates B 1+1 (C i+1>2 = c i+l)» 

For j ** 2 ••• n-1 

Generate a rotation P? , , to annihilate element 

J 

of C i+l,j 

Apply it, generating a fill-in at position (j~l,j+l) of 
P j,j-l C i+l,j 

Generate a rotation Pj-ijj+i t0 annihilate position 
Apply it, generating a fill-in at position (j+l,j) of 
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T 

C i+i > j+i = (p j,j-i c i+i,j )p j-i,j+r 

The total time to generate B i+1 = C i+1>n is 2n + 2; the corresponding 
network is shown in Figure 6. On the average 2 to 3 iterations are 
required per singular value, bringing its computation time to 6n 
steps. Note that the input t£ is zero except in the first step. 

To complete the computation of the singular values, the rotations 
generated above are applied to the partially computed singular vector 
matrices from the bandwidth reduction step. 

The incorporation of the preceeding designs into a system computing 
all the singular values is described in [7]. One remaining problem is 
the efficient computation of the shift values for convergence 
acceleration. Presently, values taken from the trailing end of the 
matrix have to be incorporated into an orthogonal rotation which is 
applied to the leading rows. Therefore it is not obvious how to 
pipeline several singular value iterations while at the same time 
maintaining quadratic (and in practice cubic) convergence. 
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Figure 1. Processors for Elimination of Subdiagonals 
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Matrix elements are represented by their subscripts. 


Figure 2 


Lg-Mesh for Subdiagonal Elimination of 
Size w = 5 and Input Matrix with p *■ q ** 2 
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(a) Stage 1: QR Decomposition Followed by Removal of Superdiagonals. 
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(b) Stage 2: Reduction to Lower Hessenberg Form Followed by Removal of Subdiagonals. 


Figure 3. Several Steps in the Reduction to Bidiagonal Form 
of a Matrix with m = 11, n ** 8, p = 2, q = 3 
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Stage 1 Stage 2 

Figure 4. Input/Output Format and Alignment for Systolic Arrays 

Performing a Reduction to Bidiagonal Form in Case p-1 < q 
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Figure 5. Processor for the Computation of Singular Values 



Matrix elements are represented by their subscripts. 

Figure 6* Systolic Array for the Computation of Singular 
Values with Input/Output Format 
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"along the way" by employing another rectangular array of 0(wm) processors. 
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